Reading QM outputs - EDA and COVP calculations


In [1]:
from __future__ import print_function
from __future__ import division

In [2]:
import numpy as np

Normally, one would want a very generalized way of reading in output files (like an argparse input argument with nargs='+' that gets looped over in a big out, but this is more to demonstrate the parsing of this specific kind of file, so we use


In [3]:
outputfilepath = "../qm_files/drop_0001_1qm_2mm_eda_covp.out"

Normally, we might also do this, where we read the contents of the entire file in as a string. This might be a bad idea for these files, since they can grow to several megabytes.


In [4]:
# with open(outputfilepath) as outputfile:
#     raw_contents = outputfile.read()

It's more efficient to loop over the file directly, which avoids having to read the whole thing into memory. This does mean that you can't open and close it right away; you add another level of indentation.


In [5]:
with open(outputfilepath) as outputfile:
    for line in outputfile:
        if "Total energy in the final basis" in line:
            print(line, end='')


 Total energy in the final basis set = -1364.5425494842
 Total energy in the final basis set = -189.2277694223
 Total energy in the final basis set = -1553.1339006689

Actually, it might be instructive to do a timing comparison between the two approaches.


In [6]:
searchstr = "Total energy in the final basis"

In [7]:
%%timeit -n 1000 -r 10
counter = 0
with open(outputfilepath) as outputfile:
    raw_contents = outputfile.read()
for line in iter(raw_contents.splitlines()):
    if searchstr in line:
        counter += 1


1000 loops, best of 10: 354 µs per loop

In [8]:
%%timeit -n 1000 -r 10
counter = 0
with open(outputfilepath) as outputfile:
    for line in outputfile:
        if searchstr in line:
            counter += 1


1000 loops, best of 10: 384 µs per loop

It looks like there's a very slight time penalty the second way, and it might be generally true that memory-efficient algorithms usually require more CPU time. The second way also looks a little bit cleaner, and it's easier to understand what's going on.

Let's change the string we're looking for to one that's more relevant to the EDA/COVP analysis.


In [9]:
searchstr = "Energy decomposition of the delocalization term, kJ/mol"

In [10]:
with open(outputfilepath) as outputfile:
    for line in outputfile:
        if searchstr in line:
            print(line, end='')


 *      Energy decomposition of the delocalization term, kJ/mol      * 

That's fine, but we also want some of the lines that come after the header.


In [11]:
with open(outputfilepath) as outputfile:
    for line in outputfile:
        if searchstr in line:
            # print 10 lines instead
            for _ in range(10):
                print(line, end='')
                line = next(outputfile)


 *      Energy decomposition of the delocalization term, kJ/mol      * 
 *-------------------------------------------------------------------* 
                DEL from fragment(row) to fragment(col)                
  -------------------------------------------------------------------    
             1           2
    1     0.00000    -9.01048
    2    -6.09647    -0.00000
 ---------------------------------------------------------------------


Here we've used two tricks:

  1. Just because we can define variables in loops (like when range() or zip() or enumerate() are used) doesn't mean we need to use them. Sometimes you'll see _ used as the loop variable when it doesn't matter what it's called, but you still need to assign a variable for a function call or something else to work properly.

  2. Any file that's open where you have access to the handle (called outputfile in the above example), or anything that can be wrapped with an iter() to make it iterable, can have the next() function called on it to return the next item. In the case of files, you iterate over the lines one by one (separated by newlines). That's why I have the statement for line in outputfile:, where outputfile is the iterator and line is the variable that contains whatever the latest item is from the outputfile iterator.

To learn more about iterators, there's the official documentation, and I found this Stack Overflow post: http://stackoverflow.com/questions/9884132/what-exactly-are-pythons-iterator-iterable-and-iteration-protocols

Usually, we don't specify a set number of extra lines to iterate, because that number isn't fixed. Instead, we parse until we hit some other line that's a good stopping point. Here is the full block we're interested in, plus the start of the other one for some context:

 *-------------------------------------------------------------------*
 *      Energy decomposition of the delocalization term, kJ/mol      *
 *-------------------------------------------------------------------*
                DEL from fragment(row) to fragment(col)
  -------------------------------------------------------------------
             1           2
    1     0.00000    -9.01048
    2    -6.09647    -0.00000
 ---------------------------------------------------------------------


 ---------------------------------------------------------------------
 *                     Charge transfer analysis                      *
 *             R.Z.Khaliullin, A.T. Bell, M.Head-Gordon              *
 *                J. Chem. Phys., 2008, 128, 184112                  *
 *-------------------------------------------------------------------*

The "variable" part of parsing here is the number of rows and columns between the two lines of dashes that come after DEL from.... That's the line we should really be search for, since it's unique in the output file, and it's closer to the lines we want to extract.

Here's the idea.

  1. Search for the line.
  2. Make sure we skip the line with the dashes.
  3. Make sure we skip the line with the column indices. Important note: We're going to assume that the number of columns won't overflow! This will only work for 5 or fewer fragments.
  4. ...

In [12]:
searchstr = "DEL from fragment(row) to fragment(col)"
with open(outputfilepath) as outputfile:
    for line in outputfile:
        if searchstr in line:
            # This is now the line with the dashes.
            line = next(outputfile)
            # This is now the line with the column indices.
            line = next(outputfile)
            # Skip again to get the first line we want to parse.
            line = next(outputfile)
            # This ensures the parsing will terminate once the block is over.
            while list(set(line.strip())) != ['-']:
                print(line, end='')
                line = next(outputfile)


    1     0.00000    -9.01048
    2    -6.09647    -0.00000

Now we're printing the correct rows. How should we store these values? It's probably best to put them in a NumPy array, but since that array needs to be allocated beforehand, we need to know the shape (which is the number of fragments. How do we get that?


In [13]:
searchstr_num_fragments = "SCF on fragment 1 out of"
with open(outputfilepath) as outputfile:
    for line in outputfile:
        if searchstr_num_fragments in line:
            nfragments = int(line.split()[-1])
print(nfragments)


2

Now, combine the two and place a NumPy array allocation in the middle.

The last tricky bit will be assigning the text/string values to array elements. We're going to use the slicing syntax for both the NumPy array and the string we're splitting.


In [14]:
searchstr_num_fragments = "SCF on fragment 1 out of"
with open(outputfilepath) as outputfile:
    for line in outputfile:
        if searchstr_num_fragments in line:
            nfragments = int(line.split()[-1])
# create an empty array (where we don't initialize the elements to 0, 1, or anything else)
fragment_del_energies = np.empty(shape=(nfragments, nfragments))
searchstr = "DEL from fragment(row) to fragment(col)"
with open(outputfilepath) as outputfile:
    for line in outputfile:
        if searchstr in line:
            line = next(outputfile)
            line = next(outputfile)
            line = next(outputfile)
            # We need to keep track of our row index with a counter, because we can't
            # use enumerate with a while loop.
            # We need to keep track of our row index in the first place because we're
            # indexing into a NumPy array.
            row_counter = 0
            while list(set(line.strip())) != ['-']:
                # 'map' float() onto every element of the list
                # map() returns a generator, so turn it back into a list
                sline = list(map(float, line.split()[1:]))
                # set all columns in a given row to 
                fragment_del_energies[row_counter, :] = sline
                line = next(outputfile)
                row_counter += 1

In [15]:
print(fragment_del_energies)


[[ 0.      -9.01048]
 [-6.09647 -0.     ]]

It's probably a good idea to turn these into functions, so for an arbitrary calculation, they can be run.


In [16]:
def get_num_fragments(outputfilepath):
    """Given a path to an output file, figure out how many fragments are part of it.
    """
    searchstr_num_fragments = "SCF on fragment 1 out of"
    with open(outputfilepath) as outputfile:
        for line in outputfile:
            if searchstr_num_fragments in line:
                nfragments = int(line.split()[-1])
                return nfragments

In [17]:
def get_eda_fragment_delocalization_energies(outputfilepath, nfragments):
    """Given a path to an output file and the number of fragments it contains, return the
    delocalization energies between fragments.
    """
    fragment_del_energies = np.empty(shape=(nfragments, nfragments))
    searchstr = "DEL from fragment(row) to fragment(col)"
    with open(outputfilepath) as outputfile:
        for line in outputfile:
            if searchstr in line:
                line = next(outputfile)
                line = next(outputfile)
                line = next(outputfile)
                row_counter = 0
                while list(set(line.strip())) != ['-']:
                    sline = list(map(float, line.split()[1:]))
                    fragment_del_energies[row_counter, :] = sline
                    line = next(outputfile)
                    row_counter += 1
    return fragment_del_energies

Now let's use it:


In [18]:
nfragments = get_num_fragments(outputfilepath)
fragment_del_energies = get_eda_fragment_delocalization_energies(outputfilepath, nfragments)
print(fragment_del_energies)


[[ 0.      -9.01048]
 [-6.09647 -0.     ]]

We can write something almost identical for the decompsition of the charge transfer term, which measures the number of millielectrons that move between fragments:


In [19]:
def get_eda_fragment_delocalization_millielectrons(outputfilepath, nfragments):
    """Given a path to an output file and the number of fragments it contains,
    return the number of millielectrons that delocalize between fragments.
    """
    fragment_del_millielectrons = np.empty(shape=(nfragments, nfragments))
    searchstr = "Delocalization from fragment(row) to fragment(col)"
    with open(outputfilepath) as outputfile:
        for line in outputfile:
            if searchstr in line:
                line = next(outputfile)
                line = next(outputfile)
                line = next(outputfile)
                row_counter = 0
                while list(set(line.strip())) != ['-']:
                    sline = list(map(float, line.split()[1:]))
                    fragment_del_millielectrons[row_counter, :] = sline
                    line = next(outputfile)
                    row_counter += 1
    return fragment_del_millielectrons

In [20]:
fragment_del_millielectrons = get_eda_fragment_delocalization_millielectrons(outputfilepath, nfragments)
print(fragment_del_millielectrons)


[[-0.13754  4.83685]
 [ 3.7973  -0.10332]]

The easier we make it to reuse our code for new calculations, the faster we get to analysis and thinking about our data.

Since we're "delocalizing" from row to column, we should be able to get the total number of millielectrons donated by each fragment as the sum over all columns for each row. To get the total number of millielectrons accepted by a fragment, we can take the sum over all rows for a given column.

For this particular calculation, fragment 1 is a combined anion/cation ionic liquid pair, and fragment 2 is CO$_2$. Knowing this, we probably expect more charge to shift from the ionic liquid onto the CO$_2$, though it's hard to say that conclusively since the anion can just delocalize onto the cation (the whole fragment is of course charge neutral). So, it shouldn't be too surprising if the numbers aren't very different.


In [21]:
me_donated_by_il = np.sum(fragment_del_millielectrons[0, :])
me_donated_by_co2 = np.sum(fragment_del_millielectrons[1, :])
print(me_donated_by_il, me_donated_by_co2)


4.69931 3.69398

There's a net donation of charge density from the ionic liquid onto the CO$_2$, as expected.

What about charge accepted?


In [22]:
me_accepted_by_il = np.sum(fragment_del_millielectrons[:, 0])
me_accepted_by_co2 = np.sum(fragment_del_millielectrons[:, 1])
print(me_accepted_by_il, me_accepted_by_co2)


3.65976 4.73353

The values are almost exactly the opposite of the charge donation values. Why aren't they exactly the same?

Parsing the COVP section

There's an additional section of output that can be requested when performing calculations with only two fragments; complementary occupied-virtual pairs (COVPs) can be formed which allows for a direct assignment between a donor orbital on one fragment with an acceptor on the other. The amount of charge transferred between COVPs in both directions is calculated in terms of energy and millielectrons.

 ---------------------------------------------------------------------
 *               Complementary occupied-virtual pairs                *
 *                  Delta E, kJ/mol; Delta Q, me-                    *
 *                        No BSSE correction                         *
 ---------------------------------------------------------------------
   From fragment 1 to fragment 2
 ---------------------------------------------------------------------
   #   Delta E(Alpha)    Delta E(Beta)  Delta Q(Alpha)   Delta Q(Beta)
 ---------------------------------------------------------------------
   1  -3.1119( 69.1%)  -3.1119( 69.1%)   1.805( 74.7%)   1.805( 74.7%)
   2  -0.9232( 20.5%)  -0.9232( 20.5%)   0.415( 17.2%)   0.415( 17.2%)
   3  -0.2344(  5.2%)  -0.2344(  5.2%)   0.119(  4.9%)   0.119(  4.9%)
   4  -0.0771(  1.7%)  -0.0771(  1.7%)   0.034(  1.4%)   0.034(  1.4%)
   5  -0.0536(  1.2%)  -0.0536(  1.2%)   0.016(  0.7%)   0.016(  0.7%)
   6  -0.0324(  0.7%)  -0.0324(  0.7%)   0.010(  0.4%)   0.010(  0.4%)
   7  -0.0245(  0.5%)  -0.0245(  0.5%)   0.009(  0.4%)   0.009(  0.4%)
   8  -0.0197(  0.4%)  -0.0197(  0.4%)   0.005(  0.2%)   0.005(  0.2%)
   9  -0.0111(  0.2%)  -0.0111(  0.2%)   0.003(  0.1%)   0.003(  0.1%)
  10  -0.0104(  0.2%)  -0.0104(  0.2%)   0.002(  0.1%)   0.002(  0.1%)
  11  -0.0023(  0.1%)  -0.0023(  0.1%)   0.001(  0.0%)   0.001(  0.0%)
  12  -0.0011(  0.0%)  -0.0011(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  13  -0.0011(  0.0%)  -0.0011(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  14  -0.0009(  0.0%)  -0.0009(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  15  -0.0005(  0.0%)  -0.0005(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  16  -0.0004(  0.0%)  -0.0004(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  17  -0.0003(  0.0%)  -0.0003(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  18  -0.0001(  0.0%)  -0.0001(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  19  -0.0001(  0.0%)  -0.0001(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  20  -0.0001(  0.0%)  -0.0001(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  21  -0.0001(  0.0%)  -0.0001(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  22  -0.0000(  0.0%)  -0.0000(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  23  -0.0000(  0.0%)  -0.0000(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  24  -0.0000(  0.0%)  -0.0000(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  25  -0.0000(  0.0%)  -0.0000(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  26  -0.0000(  0.0%)  -0.0000(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  27  -0.0000(  0.0%)  -0.0000(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  28  -0.0000(  0.0%)  -0.0000(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  29  -0.0000(  0.0%)  -0.0000(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  30  -0.0000(  0.0%)  -0.0000(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  31  -0.0000(  0.0%)  -0.0000(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  32  -0.0000(  0.0%)  -0.0000(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  33  -0.0000(  0.0%)  -0.0000(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  34  -0.0000(  0.0%)  -0.0000(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
 ---------------------------------------------------------------------
 Tot  -4.5052(100.0%)  -4.5052(100.0%)   2.418(100.0%)   2.418(100.0%)
 ---------------------------------------------------------------------
   From fragment 2 to fragment 1
 ---------------------------------------------------------------------
   #   Delta E(Alpha)    Delta E(Beta)  Delta Q(Alpha)   Delta Q(Beta)
 ---------------------------------------------------------------------
   1  -2.2084( 72.4%)  -2.2084( 72.4%)   1.532( 80.7%)   1.532( 80.7%)
   2  -0.3802( 12.5%)  -0.3802( 12.5%)   0.182(  9.6%)   0.182(  9.6%)
   3  -0.2128(  7.0%)  -0.2128(  7.0%)   0.082(  4.3%)   0.082(  4.3%)
   4  -0.1511(  5.0%)  -0.1511(  5.0%)   0.070(  3.7%)   0.070(  3.7%)
   5  -0.0526(  1.7%)  -0.0526(  1.7%)   0.020(  1.1%)   0.020(  1.1%)
   6  -0.0337(  1.1%)  -0.0337(  1.1%)   0.010(  0.5%)   0.010(  0.5%)
   7  -0.0053(  0.2%)  -0.0053(  0.2%)   0.001(  0.0%)   0.001(  0.0%)
   8  -0.0027(  0.1%)  -0.0027(  0.1%)   0.000(  0.0%)   0.000(  0.0%)
   9  -0.0011(  0.0%)  -0.0011(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  10  -0.0003(  0.0%)  -0.0003(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
  11  -0.0002(  0.0%)  -0.0002(  0.0%)   0.000(  0.0%)   0.000(  0.0%)
 ---------------------------------------------------------------------
 Tot  -3.0482(100.0%)  -3.0482(100.0%)   1.899(100.0%)   1.899(100.0%)
 ---------------------------------------------------------------------

The most interesting values are the totals from each fragment to the other. Both the energy and number of millielectrons would be good to have. There's two columns for each, one each for alpha and beta spin; since we're using a spin-restricted wavefunction, they're identical, and we only care about one spin.

It's been determined that the "target" lines containing the numbers we want are

Tot  -4.5052(100.0%)  -4.5052(100.0%)   2.418(100.0%)   2.418(100.0%)
Tot  -3.0482(100.0%)  -3.0482(100.0%)   1.899(100.0%)   1.899(100.0%)

but really just

(-4.5052, 2.418)
(-3.0482, 1.899)

so what text can we search for? # Delta E(Alpha) Delta E(Beta) Delta Q(Alpha) Delta Q(Beta) is a good choice; it isn't unique within the entire block, but it only appears inside this block, and it clearly starts each section. We can also search for Tot.


In [23]:
searchstr = "#   Delta E(Alpha)    Delta E(Beta)  Delta Q(Alpha)   Delta Q(Beta)"
with open(outputfilepath) as outputfile:
    for line in outputfile:
        if searchstr in line:
            print(line, end='')


   #   Delta E(Alpha)    Delta E(Beta)  Delta Q(Alpha)   Delta Q(Beta)
   #   Delta E(Alpha)    Delta E(Beta)  Delta Q(Alpha)   Delta Q(Beta)

Now use the trick where we stick a while loop inside the if statement and call the outputfile iterator until we hit Tot:


In [24]:
searchstr = "#   Delta E(Alpha)    Delta E(Beta)  Delta Q(Alpha)   Delta Q(Beta)"
with open(outputfilepath) as outputfile:
    for line in outputfile:
        if searchstr in line:
            # Do an exact character match on the string.
            while line[:4] != " Tot":
                line = next(outputfile)
            print(line, end='')


 Tot  -4.5052(100.0%)  -4.5052(100.0%)   2.418(100.0%)   2.418(100.0%)
 Tot  -3.0482(100.0%)  -3.0482(100.0%)   1.899(100.0%)   1.899(100.0%)

All that each line requires is a bit of manipulation: split, take the [1::2] entries (quiz: what does this do?), get rid of the percentage stuff, map the values to floats, and return them as tuples. There's a problem though: how can we uniquely return both tuples? We could append every match to a list and return the list, but I'd rather be more explicit here since we're only dealing with two lines.


In [25]:
searchstr = "#   Delta E(Alpha)    Delta E(Beta)  Delta Q(Alpha)   Delta Q(Beta)"
with open(outputfilepath) as outputfile:
    for line in outputfile:
        if searchstr in line:
            while line[:4] != " Tot":
                line = next(outputfile)
            print(line, end='')
            line = next(outputfile)
            while line[:4] != " Tot":
                line = next(outputfile)
            print(line, end='')


 Tot  -4.5052(100.0%)  -4.5052(100.0%)   2.418(100.0%)   2.418(100.0%)
 Tot  -3.0482(100.0%)  -3.0482(100.0%)   1.899(100.0%)   1.899(100.0%)

This isn't a good idea for more complicated cases (for example, it won't work if Tot is on two consecutive lines), but it works more often than not.

The lines that we just print to the screen can now be manipulated and assigned to unique variables:


In [26]:
searchstr = "#   Delta E(Alpha)    Delta E(Beta)  Delta Q(Alpha)   Delta Q(Beta)"
with open(outputfilepath) as outputfile:
    for line in outputfile:
        if searchstr in line:
            while line[:4] != " Tot":
                line = next(outputfile)
            f_1_to_2 = tuple([float(x[:-8]) for x in line.split()[1::2]])
            print(f_1_to_2)
            line = next(outputfile)
            while line[:4] != " Tot":
                line = next(outputfile)
            f_2_to_1 = tuple([float(x[:-8]) for x in line.split()[1::2]])
            print(f_2_to_1)


(-4.5052, 2.418)
(-3.0482, 1.899)

Notice that the list(map(float, line.split())) trick can't be used, because we are just doing a type conversion for each element, but also a slicing operation. We could also do the slicing operation with a map and an anonymous function, but it doesn't look as nice:


In [27]:
searchstr = "#   Delta E(Alpha)    Delta E(Beta)  Delta Q(Alpha)   Delta Q(Beta)"
with open(outputfilepath) as outputfile:
    for line in outputfile:
        if searchstr in line:
            while line[:4] != " Tot":
                line = next(outputfile)
            f_1_to_2 = tuple(map(lambda x: float(x[:-8]), line.split()[1::2]))
            print(f_1_to_2)
            line = next(outputfile)
            while line[:4] != " Tot":
                line = next(outputfile)
            f_2_to_1 = tuple(map(lambda x: float(x[:-8]), line.split()[1::2]))
            print(f_2_to_1)


(-4.5052, 2.418)
(-3.0482, 1.899)

Maybe it looks fine; if you've never used an anonymous function before it can be a bit odd. I just tend to write the former with the explicit list comprehension.

Now turn it into a function:


In [28]:
def get_eda_covp_totals(outputfilepath):
    """Given a path to an output file, return the totals for each fragment from the COVP analysis.
    The first element of the tuple is the energy contribution, the second element is the
    number of millielectrons transferred."""
    searchstr = "#   Delta E(Alpha)    Delta E(Beta)  Delta Q(Alpha)   Delta Q(Beta)"
    with open(outputfilepath) as outputfile:
        for line in outputfile:
            if searchstr in line:
                while line[:4] != " Tot":
                    line = next(outputfile)
                f_1_to_2 = tuple(map(lambda x: float(x[:-8]), line.split()[1::2]))
                line = next(outputfile)
                while line[:4] != " Tot":
                    line = next(outputfile)
                f_2_to_1 = tuple(map(lambda x: float(x[:-8]), line.split()[1::2]))
    return f_1_to_2, f_2_to_1

In [29]:
f_1_to_2, f_2_to_1 = get_eda_covp_totals(outputfilepath)
print(f_1_to_2)
print(f_2_to_1)


(-4.5052, 2.418)
(-3.0482, 1.899)

Summary

What are some of the advantages of these approaches?

  1. Keep as much stuff inside functions as possible. If a function has more than a handful of functions, it's either too big or just shouldn't be a function. These only take one or two arguments.
  2. There are quite a few loops, most of them nested, which can be a problem, but the control flow for finding specific lines in files is very explicit. This is a better alternative to reading in the file, splitting the whole thing on newlines, wrapping in an iterator, calling next() a fixed number of times, etc.
  3. A balance between being very specific and very general. This mostly comes from practice and experience.